In this notebook, I carry out an in-depth exploratory and descriptive analysis of the Superstore Dataset, a widely used dataset for understanding sales performance, customer behavior, and profitability based on various transactional and demographic attributes.
This phase of analysis is essential for uncovering sales trends, identifying key performance drivers, and gaining intuition about the dataset’s structure before applying any forecasting or optimization procedures. I examine the distribution of key numerical and categorical variables, investigate relationships between product features, customer segments, and geographical regions with sales and profit levels, and use visualizations to summarize insights. Particular focus is placed on sales and profit disparities across product categories, customer segments, geographical regions, and **shipping modes, helping lay a solid foundation for downstream business intelligence and stratision-making.
The analysis begins by importing essential Python libraries for data handling, numerical computation, visualization, and directory management:
pandas: For efficient manipulation, filtering, and aggregation of tabular data.
numpy: Provides support for large, multi-dimensional arrays and mathematical functions.
os: Facilitates interaction with the operating system, particularly for path management.
plotly.express: A high-level API for creating interactive, publication-quality visualizations.
plotly.io: Used for configuring Plotly renderers and saving figures.
scipy.stats: Provides statistical functions for hypothesis testing.
statsmodels.api & statsmodels.formula.api: For statistical modeling, including ANOVA.
seaborn: For creating informative and attractive statistical graphics, especially heatmaps.
matplotlib.pyplot: A plotting library used for saving seaborn plots.
To ensure reproducibility and organized storage, I programmatically create directories if they don't already exist for:
These directories will store intermediate and final outputs for reproducibility.
I load the cleaned version of the Superstore Dataset from the processed data directory into a Pandas DataFrame. The head(10) function shows the first ten records, giving a glimpse into the data columns such as Order ID, Product Name, Sales, etc.
| Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | ... | Product Name | Sales | Quantity | Discount | Profit | Returned | Person | Shipping Duration | Order Year | Order Month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | CA-2016-152156 | 2016-11-08 | 2016-11-11 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | Bush Somerset Collection Bookcase | 261.9600 | 2 | 0.00 | 41.9136 | No | Cassandra Brandow | 3 | 2016 | 11 |
| 1 | 2 | CA-2016-152156 | 2016-11-08 | 2016-11-11 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.9400 | 3 | 0.00 | 219.5820 | No | Cassandra Brandow | 3 | 2016 | 11 |
| 2 | 3 | CA-2016-138688 | 2016-06-12 | 2016-06-16 | Second Class | DV-13045 | Darrin Van Huff | Corporate | United States | Los Angeles | ... | Self-Adhesive Address Labels for Typewriters b... | 14.6200 | 2 | 0.00 | 6.8714 | No | Anna Andreadi | 4 | 2016 | 6 |
| 3 | 4 | US-2015-108966 | 2015-10-11 | 2015-10-18 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | ... | Bretford CR4500 Series Slim Rectangular Table | 957.5775 | 5 | 0.45 | -383.0310 | No | Cassandra Brandow | 7 | 2015 | 10 |
| 4 | 5 | US-2015-108966 | 2015-10-11 | 2015-10-18 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | ... | Eldon Fold 'N Roll Cart System | 22.3680 | 2 | 0.20 | 2.5164 | No | Cassandra Brandow | 7 | 2015 | 10 |
| 5 | 6 | CA-2014-115812 | 2014-06-09 | 2014-06-14 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | ... | Eldon Expressions Wood and Plastic Desk Access... | 48.8600 | 7 | 0.00 | 14.1694 | No | Anna Andreadi | 5 | 2014 | 6 |
| 6 | 7 | CA-2014-115812 | 2014-06-09 | 2014-06-14 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | ... | Newell 322 | 7.2800 | 4 | 0.00 | 1.9656 | No | Anna Andreadi | 5 | 2014 | 6 |
| 7 | 8 | CA-2014-115812 | 2014-06-09 | 2014-06-14 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | ... | Mitel 5320 IP Phone VoIP phone | 907.1520 | 6 | 0.20 | 90.7152 | No | Anna Andreadi | 5 | 2014 | 6 |
| 8 | 9 | CA-2014-115812 | 2014-06-09 | 2014-06-14 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | ... | DXL Angle-View Binders with Locking Rings by S... | 18.5040 | 3 | 0.20 | 5.7825 | No | Anna Andreadi | 5 | 2014 | 6 |
| 9 | 10 | CA-2014-115812 | 2014-06-09 | 2014-06-14 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | ... | Belkin F5C206VTEL 6 Outlet Surge | 114.9000 | 5 | 0.00 | 34.4700 | No | Anna Andreadi | 5 | 2014 | 6 |
10 rows × 26 columns
This section provides summary statistics for both numerical and categorical variables,
Provides basic statistics (mean, std, min, max) for numerical columns like Sales, Profit, etc.
Summarizes frequency distributions of categorical columns like Region, Category, and Segment.
Here, I examine the structure of the dataset:
Sales, Profit, Quantity) and categorical variables (e.g., Category, Region, Segment).Understanding data types and null entries is essential before proceeding with analysis.
Dataset Dimensions (Rows, Columns): (9994, 26)
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 9994 entries, 0 to 9993 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Row ID 9994 non-null int64 1 Order ID 9994 non-null object 2 Order Date 9994 non-null object 3 Ship Date 9994 non-null object 4 Ship Mode 9994 non-null object 5 Customer ID 9994 non-null object 6 Customer Name 9994 non-null object 7 Segment 9994 non-null object 8 Country 9994 non-null object 9 City 9994 non-null object 10 State 9994 non-null object 11 Postal Code 9994 non-null int64 12 Region 9994 non-null object 13 Product ID 9994 non-null object 14 Category 9994 non-null object 15 Sub-Category 9994 non-null object 16 Product Name 9994 non-null object 17 Sales 9994 non-null float64 18 Quantity 9994 non-null int64 19 Discount 9994 non-null float64 20 Profit 9994 non-null float64 21 Returned 9994 non-null object 22 Person 9994 non-null object 23 Shipping Duration 9994 non-null int64 24 Order Year 9994 non-null int64 25 Order Month 9994 non-null int64 dtypes: float64(3), int64(6), object(17) memory usage: 2.0+ MB
This summary provides a snapshot of key distribution characteristics.
Summary statistics for numerical variables:
| Row ID | Postal Code | Sales | Quantity | Discount | Profit | Shipping Duration | Order Year | Order Month | |
|---|---|---|---|---|---|---|---|---|---|
| count | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 |
| mean | 4997.500000 | 55190.379428 | 229.858001 | 3.789574 | 0.156203 | 28.656896 | 3.958175 | 2015.722233 | 7.809686 |
| std | 2885.163629 | 32063.693350 | 623.245101 | 2.225110 | 0.206452 | 234.260108 | 1.747567 | 1.123555 | 3.284654 |
| min | 1.000000 | 1040.000000 | 0.444000 | 1.000000 | 0.000000 | -6599.978000 | 0.000000 | 2014.000000 | 1.000000 |
| 25% | 2499.250000 | 23223.000000 | 17.280000 | 2.000000 | 0.000000 | 1.728750 | 3.000000 | 2015.000000 | 5.000000 |
| 50% | 4997.500000 | 56430.500000 | 54.490000 | 3.000000 | 0.200000 | 8.666500 | 4.000000 | 2016.000000 | 9.000000 |
| 75% | 7495.750000 | 90008.000000 | 209.940000 | 5.000000 | 0.200000 | 29.364000 | 5.000000 | 2017.000000 | 11.000000 |
| max | 9994.000000 | 99301.000000 | 22638.480000 | 14.000000 | 0.800000 | 8399.976000 | 7.000000 | 2017.000000 | 12.000000 |
This summary provides insights into the distribution and most frequent categories for the object (categorical) variables in the dataset.
Summary statistics for categorical variables:
| Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | State | Region | Product ID | Category | Sub-Category | Product Name | Returned | Person | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 | 9994 |
| unique | 5009 | 1237 | 1334 | 4 | 793 | 793 | 3 | 1 | 531 | 49 | 4 | 1862 | 3 | 17 | 1850 | 2 | 4 |
| top | CA-2017-100111 | 2016-09-05 | 2015-12-16 | Standard Class | WB-21850 | William Brown | Consumer | United States | New York City | California | West | OFF-PA-10001970 | Office Supplies | Binders | Staple envelope | No | Anna Andreadi |
| freq | 14 | 38 | 35 | 5968 | 37 | 37 | 5191 | 9994 | 915 | 2001 | 3203 | 19 | 6026 | 1523 | 48 | 9194 | 3203 |
Interpretation of Categorical Summary Statistics:
Understanding the distribution of key categorical variables provides crucial insights into the Superstore's operational landscape and customer base.
Shows proportions (in %) of entries per Region, Category, and Segment.
Distribution of 'Region': West 3203 East 2848 Central 2323 South 1620 Name: Region, dtype: int64 Normalized Distribution of 'Region': West 32.05% East 28.5% Central 23.24% South 16.21% Name: Region, dtype: object
Distribution of 'Category': Office Supplies 6026 Furniture 2121 Technology 1847 Name: Category, dtype: int64 Normalized Distribution of 'Category': Office Supplies 60.3% Furniture 21.22% Technology 18.48% Name: Category, dtype: object
Distribution of 'Segment': Consumer 5191 Corporate 3020 Home Office 1783 Name: Segment, dtype: int64 Normalized Distribution of 'Segment': Consumer 51.94% Corporate 30.22% Home Office 17.84% Name: Segment, dtype: object
This section focuses on in-depth Exploratory Data Analysis (EDA) through various visualizations.
This line chart displays seasonal sales patterns throughout the year, revealing distinct peaks and valleys in business performance.
Total Sales by Month: Order Month Sales 0 January 94924.8356 1 February 59751.2514 2 March 205005.4888 3 April 137762.1286 4 May 155028.8117 5 June 152718.6793 6 July 147238.0970 7 August 159044.0630 8 September 307649.9457 9 October 200322.9847 10 November 352461.0710 11 December 325293.5035
This correlation heatmap displays the relationships between key business metrics: Sales, Quantity, Discount, Profit, and Shipping Duration, with correlation coefficients ranging from -1 to +1. Key Observations:
This heatmap displays the contingency table showing the relationship between shipping methods and return status, with statistical significance confirmed by a Chi-squared test (χ² = 22.95, p < 0.001).
Contingency Table:
No Yes
First Class 1386 152
Same Day 479 64
Second Class 1811 134
Standard Class 5518 450
Chi-squared Statistic: 22.95
P-value: 0.000
Degrees of Freedom: 3
This ANOVA table shows the statistical analysis of profit differences across product categories (Furniture, Office Supplies, Technology) using one-way ANOVA.
Hypothesis: Is there a significant difference in profit across different product categories? - Null Hypothesis : The mean profit is the same across all product categories. - Alternative Hypothesis : At least one product category has a different mean profit. ANOVA Table for Profit by Category:
| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| C(Category) | 5.898009e+06 | 2.0 | 54.311023 | 3.469918e-24 |
| Residual | 5.424958e+08 | 9991.0 | NaN | NaN |
This independent samples t-test compares profit levels between discounted items (discount > 0%) and non-discounted items (discount = 0%) to determine if discounting significantly impacts profitability. Key Observations:
Hypothesis: Is there a significant difference in profit between items sold with a discount and items sold without a discount? - Null Hypothesis : The mean profit for discounted items is equal to the mean profit for non-discounted items. - Alternative Hypothesis : The mean profit for discounted items is different from the mean profit for non-discounted items. T-test: Profit difference by Discount status T-statistic: -15.737992941015493 P-value: 4.356930371141414e-55
This contingency table and chi-square test analyze the relationship between geographic regions and return rates, with statistical significance confirmed by χ² = 343.57 and p < 0.001.
Contingency Table (Region vs. Returned): Returned No Yes Region Central 2231 92 East 2699 149 South 1551 69 West 2713 490 Chi-square statistic: 343.5657 P-value for Region vs. Returned: 0.0000 Conclusion: Since the p-value (0.0000) is less than 0.05, we reject the null hypothesis. There is a statistically significant association between 'Region' and whether an item is 'Returned'. This suggests that return rates may vary by region, warranting further investigation into regional customer satisfaction or product quality issues.
This scatter plot illustrates the relationship between discount rate and profit, with points colored by product Category (Furniture, Office Supplies, Technology).
Higher discounts generally correlate with lower profits, particularly for Office Supplies and Technology.
Many points with discounts above 40% fall into the negative profit range, indicating possible over-discounting.
Some Technology items show high profits at 0% discount, emphasizing that discounts may not be necessary for profitable sales in this category.
Furniture items seem more stable but also tend to lose profitability with increasing discounts.
Re-evaluate the discounting strategy, especially for high-loss categories like Office Supplies.
Consider implementing a discount cap policy to avoid over-discounting.
Use targeted discounting rather than blanket discounts to preserve profitability while still driving sales.
These bar charts show total sales and profit by product sub-category.
Phones and Chairs have the highest sales, but Chairs have lower profit contribution.
Copiers generate the highest profit, even with moderate sales.
Tables, Bookcases, and Supplies generate losses, despite decent sales volumes.
Accessories and Binders show healthy profit-to-sales alignment.
Boost sales of high-profit items like Copiers and Accessories.
Reassess or reduce focus on Tables, Bookcases, and Supplies due to consistent losses.
Monitor Chairs and Appliances profitability—optimize pricing or reduce costs.
Align marketing and discounting strategy to maximize profitability, not just sales.
Insights from the Boxplot:
All three segments show a similar spread and central tendency of profit values.
There are outliers in each group (e.g., high profits and losses), but the overall shapes of the distributions are comparable.
This visually supports the ANOVA finding: segment-based targeting may not drive significantly different profits.
--- ANOVA Results: Profit by Segment --- F-statistic: 0.90 P-value: 4.074e-01
This bar chart shows total sales distribution across four shipping modes, with Standard Class significantly dominating the sales volume.
This bar chart displays the total quantity of products sold across three customer segments, revealing clear volume differences between segments.
Total Quantity by Segment:
Segment Quantity
0 Consumer 19521
1 Corporate 11608
2 Home Office 6744
This pie chart shows the profit contribution of three product categories, with Technology leading as the primary profit driver.
Total Profit by Category:
Category Profit
0 Furniture 18451.2728
1 Office Supplies 122490.8008
2 Technology 145454.9481
This bar chart reveals regional sales distribution across four geographic areas, showing significant variation in market performance.
Total Sales by Region:
Region Sales
3 West 725457.8245
1 East 678781.2400
0 Central 501239.8908
2 South 391721.9050